Add GLM 5 MTP by SamuelOliveirads · Pull Request #1513 · ikawrakow/ik_llama.cpp

SamuelOliveirads · 2026-03-25T21:47:57Z

Add mtp support for GLM-5, to try use the args -mtp to activate and --draft-max, --draft-p-min to control how much tokens you want to generate.

Test's applied

Test 1: Write a quick sort python algorithm, answer only the code.
Test 2: Extract all core events with their exact dates into a bulleted list I copied the "Top" YouTube section from Wikipedia: https://en.wikipedia.org/wiki/YouTube#
Test 3: Write an unexpected short story about someone exploring a cyberpunk city in 2077, but the main character's internal dialogue is deeply analytical and philosophical.

GLM 5 smol-IQ2_KS - Draft size = 10, p-min = 0.85, -ot "blk.78..*=CUDA1", --seed 42

Without MTP vs With MTP

Prompt	Baseline (ts)	MTP (ts)	Accept Rate (%)	Difference (%)
Quicksort python	8.18	6.90	62.2%	-15.65%
Test reasoning	8.23	5.36	57.8%	-34.87%
Creative writing	8.13	5.11	50.8%	-37.15%

Ports the Multi-Token Prediction (MTP) architecture to the older `llama.cpp` codebase used by `ikllama`. Changes include: - Updating `llama_batch` to support `mtp_params`. - Modifying `llama_decode_internal` (and `encode`) to handle MTP operations (Warmup, Update, Draft). - Adding public APIs for MTP state management (`llama_set_draft_input_hidden_state`). - Adapting the embedding extraction logic to skip MTP update passes.

…or Draft Model).

…drafting

jukofyork · 2026-03-26T16:11:04Z

@jukofyork

Are there any standardized tests to check the scores of the LLM regarding the tool-call performance etc. that can be ran locally?

Not really, but you will very quickly find out if it starts hallucinating the tool calls in the chat (deepseek is particularly bad and fails almost instantly).

magikRUKKOLA · 2026-03-26T16:14:16Z

@jukofyork

Okay what about using smol-IQ1_KT fully GPU-offloaded as a draft for a larger quant with only offloaded head and the KV-cache?

Having about 31 tps decode at zero ctx and 21 tps at 32k ctx. [EDIT]: naaah. I don't think its worth it. 21 tps at 32k ctx is already slow enough. Hmm... I should probably finally try with double EPYC.

SamuelOliveirads · 2026-03-26T16:32:32Z

I think it would be better to first achieve performance improvement via MTP before adding MTP for more models.

@ikawrakow To be honest, I already had the GLM5 and use it fairly often, so I wanted to add it to have a point of comparison. As for other MTPs, I don’t plan on adding them for now, especially since we don’t retain the layer and it’s unlikely anyone would want to re-quantize just to test a slow feature.

Have you tried -mla 1 (assuming you used -mla 3)?

@jukofyork With MLA 1 or 3 I saw slightly lower performance, for me the best performance was: no MLA > MLA3 > MLA1. To be honest, I haven’t been fine-tuning the arguments for a while, but since you mentioned -draft-min, I have an idea in mind that might help better define that parameter, I’ll see how it works in practice later.

/opt/ik_llama.cpp/ik_llama.cpp/src/llama.cpp:4087: GGML_ASSERT(lctx.embd != nullptr) failed

@magikRUKKOLA Could you give me some details about the arguments used? I tested it with Kimi K2.5, thinking there was an incompatibility with MTP, then I tested it with GLM5 without MTP and didn't get any errors.

magikRUKKOLA · 2026-03-26T16:50:23Z

@SamuelOliveirads

Could you give me some details about the arguments used?

/opt/ik_llama.cpp/ik_llama.cpp/build/bin/llama-server \
    --model /opt/ubergarm/GLM-5-GGUF/smol-IQ2_KS/GLM-5-smol-IQ2_KS-00001-of-00006.gguf \
    --alias ubergarm/GLM-5-smol-IQ2_KS \
    --ctx-size $((128 * 1024)) \
    -b $((1024)) -ub $((1024)) \
    --mlock \
    --temp 0.0 --top-p 1.0 --top-k 0 \
    -ctk q6_0 \
    -ctv q6_0 \
    -mtp \
    -khad \
    -ger \
    -smgs \
    -sas \
    -muge \
    -mea 16 \
    -amb 16 \
    --merge-qkv \
    --graph-reduce-type bf16 \
    --split-mode layer \
    --main-gpu 0 \
    --max-gpu 0 \
    --n-gpu-layers 99 \
    --threads $(grep ^cpu\\scores /proc/cpuinfo | uniq | awk '{print $4}' | xargs -I{} echo "{}-0" | bc) \
    --host 0.0.0.0 \
    --port 8080 \
    --log-enable \
    --logdir /var/log/ \
    --jinja \
    --special \
    --verbosity 1 \
    --verbose-prompt \
    --reasoning-format auto \
    --prompt-cache "$HOME/.cache/ik_llama.cpp/prompt-cache.bin" --prompt-cache-all \
    --slot-save-path "$HOME/.cache/ik_llama.cpp/slot.bin" \
    --lookup-cache-dynamic "$HOME/.cache/ik_llama.cpp/slot.bin" \
    --keep -1 \
    --slot-prompt-similarity 0.35 \
    --metrics \
    -cuda fusion=1

[EDIT]: woops. I had to use --threads 1. But that would not matter much anyway.

jukofyork · 2026-03-26T17:26:57Z

@jukofyork With MLA 1 or 3 I saw slightly lower performance, for me the best performance was: no MLA > MLA3 > MLA1. To be honest, I haven’t been fine-tuning the arguments for a while, but since you mentioned -draft-min, I have an idea in mind that might help better define that parameter, I’ll see how it works in practice later.

On mainline llama.cpp I found the best thing to do is run a sweep of all batch sizes from 1 to 64 and plot them. You often see things in the first 2-8 batch sizes that help tune the draft parameters.

magikRUKKOLA · 2026-03-26T18:07:47Z

@SamuelOliveirads

[EDITED]:

/opt/ubergarm/Kimi-K2.5-GGUF/smol-IQ1_KT:

WARN [              load_model] WARNING: -mtp flag provided, but model has 0 NextN layers. MTP will be disabled.

magikRUKKOLA · 2026-03-26T18:15:01Z

GLM 5 smol-IQ2_KS - Draft size = 10, p-min = 0.85, -ot "blk.78..*=CUDA1", --seed 42

What arguments should I use once again? How to set the draft size ?

[EDIT]: Oh. I see. So via the --draft-max which is 16 by default.

magikRUKKOLA · 2026-03-26T18:30:16Z

I tested it with Kimi K2.5, thinking there was an incompatibility with MTP, then I tested it with GLM5 without MTP and didn't get any errors.

Its with -mtp provided for GLM5.

SamuelOliveirads · 2026-03-26T20:29:16Z

@magikRUKKOLA I wasn't able to reproduce the same error with your arguments, the only difference was that I couldn't fully offload to the GPU with such a large model. That said, there were some errors that occurred, and they were fixed after the most recent rebase of the branch. Since your first test was done before that, please try making a new pull.

To provide more context, the models that have MTP and support it are GLM 4.5/4.6/4.7 and 5.0. You can try running the -mtp command with any other model, and it will be disabled (I used Kimi K2.5 as a test to see if this logic was causing your crash before).

Currently, MTP only supports --draft-max and --draft-p-min

On mainline llama.cpp I found the best thing to do is run a sweep of all batch sizes from 1 to 64 and plot them. You often see things in the first 2-8 batch sizes that help tune the draft parameters.

@jukofyork I believe that certain parameters, such as draft-max, draft-min, and p-min, could be optimized, perhaps using a controller that can adjust the parameters based on the hit rate of the speculative models. Since you’re running some tests, are there any parameters you’d like me to test?

magikRUKKOLA · 2026-03-26T20:37:18Z

@SamuelOliveirads

Aha! Yes, it does not crash indeed.
With -mtp its a lot slower. I will publish the results for the first test.

Its like molasses, yeah.

VERB [speculative_decoding_accept] speculative decoding result | tid="140439775965184" timestamp=1774557499 id_slot=0 accepted=1 total=0 new_n_past=1562
VERB [            update_slots] run slots completed | tid="140439775965184" timestamp=1774557499
VERB [              start_loop] wait for new task | tid="140439775965184" timestamp=1774557499
VERB [              start_loop] new task may arrive | tid="140439775965184" timestamp=1774557499
slot print_timing: id  0 | task 184 | 
prompt eval time =      56.74 ms /     1 tokens (   56.74 ms per token,    17.62 tokens per second)
       eval time =  129752.55 ms /  1538 tokens (   84.36 ms per token,    11.85 tokens per second)
      total time =  129809.29 ms /  1539 tokens
VERB [              start_loop] update_multitasks | tid="140439775965184" timestamp=1774557499
draft acceptance rate = 0.57330 (  786 accepted /  1371 generated)

without -mtp:

prompt eval time =     600.34 ms /    24 tokens (   25.01 ms per token,    39.98 tokens per second)
       eval time =   62964.02 ms /  1575 tokens (   39.98 ms per token,    25.01 tokens per second)
      total time =   63564.37 ms /  1599 tokens

Overall, with -mtp its about 2 times slower decode.

SamuelOliveirads · 2026-03-26T21:24:21Z

Overall, with -mtp its about 2 times slower decode.

Don't worry, one day it will be optimized enough to be worth it (I hope).

magikRUKKOLA · 2026-03-26T22:03:59Z

@SamuelOliveirads

Should I re-try with hybrid inference?

jukofyork · 2026-03-26T23:06:28Z

@magikRUKKOLA I wasn't able to reproduce the same error with your arguments, the only difference was that I couldn't fully offload to the GPU with such a large model. That said, there were some errors that occurred, and they were fixed after the most recent rebase of the branch. Since your first test was done before that, please try making a new pull.

To provide more context, the models that have MTP and support it are GLM 4.5/4.6/4.7 and 5.0. You can try running the -mtp command with any other model, and it will be disabled (I used Kimi K2.5 as a test to see if this logic was causing your crash before).

Currently, MTP only supports --draft-max and --draft-p-min

On mainline llama.cpp I found the best thing to do is run a sweep of all batch sizes from 1 to 64 and plot them. You often see things in the first 2-8 batch sizes that help tune the draft parameters.

@jukofyork I believe that certain parameters, such as draft-max, draft-min, and p-min, could be optimized, perhaps using a controller that can adjust the parameters based on the hit rate of the speculative models. Since you’re running some tests, are there any parameters you’d like me to test?

See the posts in this thread, starting here:

ggml-org/llama.cpp#10466 (comment)

I tried to simplify it to the bare minimum here:

ggml-org/llama.cpp#17034

but nobody seemed interested and mainline llama.cpp speculative decoding logic keeps getting more and more complex, so not really sure if I can revive it now.

The key thing from all my experiments is that you can't really just use a fixed min-p as there are all sorts of weird jumps in the batch costs depending on the backend(s) used, FA optimisations, MMQ thresholds, and so on... You really have to consider the sequence probabilities and batch costs for each batch size to get it working well:

Some kind of adaptive controller would be the next step, but there was pretty much zero interest in that discussion and PR...

I'm also not convinced the current logic is correct:

ggml-org/llama.cpp#10466 (comment)

The code has got so many tricky optimisations in it now though, but I think you can show that if batch=2 has > 2 × batch=1 we should never actually use batch=2, but the state of the code when I made that post meant you always would try batch=2 even if the single token you saw before breaking from the loop had a super low probability.

If you look at the costs for my GLM-4.6 in the graph above, it never makes sense to try batch=2 as it is slower than just running batch=1 twice.

SamuelOliveirads · 2026-03-27T01:43:36Z

Should I re-try with hybrid inference?

@magikRUKKOLA If you want to test whether the GLM5 MTP code works, go ahead I appreciate it, but in terms of performance, it shouldn't make much of a difference.

See the posts in this thread, starting here:

ggml-org/llama.cpp#10466 (comment)

I tried to simplify it to the bare minimum here:

ggml-org/llama.cpp#17034

@jukofyork This is a great material, I need more time to read through the details, but I’ll definitely use it when I start working on this feature. I believe parameter inferences can be made in real time, which allows for adapting the settings to the user’s needs and use cases. At the end of the session, a snapshot of the current metrics could be provided so that the user can use it as a default in the future if they wish.

magikRUKKOLA · 2026-03-27T07:30:44Z

@SamuelOliveirads

GLM5 IQ2_KL --cpu-moe:

without -mtp:

prompt eval time =    1776.41 ms /    24 tokens (   74.02 ms per token,    13.51 tokens per second)
       eval time =  111475.06 ms /  1211 tokens (   92.05 ms per token,    10.86 tokens per second)
      total time =  113251.47 ms /  1235 tokens

with -mtp:

prompt eval time =    1308.86 ms /    24 tokens (   54.54 ms per token,    18.34 tokens per second)
       eval time =  192046.45 ms /  1437 tokens (  133.64 ms per token,     7.48 tokens per second)
      total time =  193355.32 ms /  1461 tokens
VERB [speculative_decoding_accept] speculative decoding result | tid="139636109680640" timestamp=1774596621 id_slot=0 accepted=1 total=0 new_n_past=1461
draft acceptance rate = 0.57460 (  751 accepted /  1307 generated)

SamuelOliveirads · 2026-03-27T13:51:57Z

GLM5 IQ2_KL --cpu-moe:

without -mtp:

prompt eval time =    1776.41 ms /    24 tokens (   74.02 ms per token,    13.51 tokens per second)
       eval time =  111475.06 ms /  1211 tokens (   92.05 ms per token,    10.86 tokens per second)
      total time =  113251.47 ms /  1235 tokens

with -mtp:

prompt eval time =    1308.86 ms /    24 tokens (   54.54 ms per token,    18.34 tokens per second)
       eval time =  192046.45 ms /  1437 tokens (  133.64 ms per token,     7.48 tokens per second)
      total time =  193355.32 ms /  1461 tokens
VERB [speculative_decoding_accept] speculative decoding result | tid="139636109680640" timestamp=1774596621 id_slot=0 accepted=1 total=0 new_n_past=1461
draft acceptance rate = 0.57460 (  751 accepted /  1307 generated)

The performance loss is consistent with my tests, which leads me to believe that the initial gains will be in hybrid/CPU-only inference, but that in the future the main gains will come from the GPU.

SamuelOliveirads · 2026-05-03T23:07:02Z

I did a rebase and the gap has narrowed, but it’s still there:

Benchmark (`GLM 5 --runs 1 --max-tokens 1500`)

Mode	Code	Extract	Story	Overall	Accept rate
Baseline	11.3 t/s (1009 tok)	11.4 t/s	11.3 t/s	11.3 ± 0.0 t/s	N/A
MTP `draft-max 1`	9.7 t/s (832 tok)	9.9 t/s	8.9 t/s	9.5 ± 0.5 t/s	85.1% ± 12.2%
MTP `draft-max 2`	8.8 t/s (832 tok)	10.0 t/s	8.1 t/s	9.0 ± 0.9 t/s	59.5% ± 12.8%

The gap remains between 17% and 22%, and the drop in the acceptance rate to 60% catches my attention, there may be some issue with the MTP embeddings, and hopefully it would be worth testing with draft 3.

ikawrakow · 2026-05-04T05:07:36Z

+    cur = llm_build_norm(ctx0, cur, hparams, mtp_layer.attn_norm, NULL, LLM_NORM_RMS, cb, il);
+    cb(cur, "attn_norm", il);
+
+    {


Apart from the above construction of the MTP input, is this function just a copy of build_deepseek2 for one layer?

Yes, mtp is a typical one-layer decoder after the inputs.

I reviewed it, and the architecture matches SGLang, there was just a small fix missing in the post-layer. I also checked the embeddings just to be sure, and they match. I reran the benchmark along with a new rebase, and the performance remains consistent with what I highlighted in my last test. I also tested with --draft-max 3, yielding 8.1 ± 1.2 t/s overall and 45.8% ± 12.6% accept.

SamuelOliveirads added 30 commits February 5, 2026 19:22

Refactors server_slot to support generic speculative decoding (MTP …

b75f70e

…or Draft Model).

core: enable hybrid outputs (logits + embeddings) for MTP support

f9c4f6c

fix(mtp): correct KV-cache slot finding for updates

b61daeb

fix(mtp): persist hidden states to prevent context corruption during …

c03ae51

…drafting

refactor(mtp): clean unused code

ab6f4bb

fix(mtp): update server to new functions name

ec2d1a0

fix(mtp): fix graph and save hidden state

9317463

mtp: refactor integration, context params and kv cache search

d3465f1

mtp: fix hidden state extraction and speculative acceptance flow

2539f4f

server: fix MTP warmup for long prompts and reset token buffer

07e4936

llama: refactor MTP operation state to context parameters

d088faa

server: fix n_past calculation in MTP acceptance

97ec50e

llama: fix mtp enable flags

573170e

Merge branch 'main' into feat-glm-mtp

5260bf2

speculative: refactor MTP to use common_speculative interface

b4a2c88

context: remove unused signatures

b8f27f3

clip: fix deprecated enum-enum conversion warning

dd684fb

common: fix format string crash in help message

0bcee4e

context: fix mtp activation logic

1d5b287

llamat: always use the extracted embedding

1da0758

llama: get all embeddings to kv cache

4d774d0

Merge branch 'main' into fix-mtp-embedding

dc5ee27

llama: revert logit to not run mtp for not supported arch

1ab6327

llama: allocate all the n_outputs for MTP

5eec0d3

wip

301f3db

server-context: get only the last embedding for hidden state

6236fb3

ggml-backend: fix array of bounds in debug build

f548ac1

server-context: run mt kv update to each prompt batch

d53dfc7

revert segmentation fault fixes

94c8184

magikRUKKOLA mentioned this pull request Mar 26, 2026

Bug: Bad tool calling performance on Qwen3.5 35B A3B (aarch64, CPU inference) #1487

Closed

SamuelOliveirads added 2 commits May 2, 2026 20:59

wip

deb13ea

glm-mtp: standardize the MTP graph

767ebca

ikawrakow reviewed May 4, 2026

View reviewed changes

SamuelOliveirads added 5 commits May 4, 2026 14:38

glm 5 mtp: apply post-layer cvec

0ead56d

Merge branch 'main' into feat/glm5-mtp

284d754

glm 5 mtp: mark head as mandatory

b3a3be0

Merge remote-tracking branch 'origin/main' into feat/glm5-mtp

9edded3

get normed embeddings for glm 5

ee51a7a

SamuelOliveirads mentioned this pull request May 18, 2026

Fix Qwen3.6-MoE low MTP acceptance rate #1815

Merged

Merge branch 'main' into feat/glm5-mtp

4646800

ikawrakow mentioned this pull request May 27, 2026

GLM-5 MTP (again) #1890

Merged

SamuelOliveirads closed this May 29, 2026

Conversation

SamuelOliveirads commented Mar 25, 2026

Test's applied

GLM 5 smol-IQ2_KS - Draft size = 10, p-min = 0.85, -ot "blk.78..*=CUDA1", --seed 42

Without MTP vs With MTP

Uh oh!

jukofyork commented Mar 26, 2026

Uh oh!

magikRUKKOLA commented Mar 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SamuelOliveirads commented Mar 26, 2026

Uh oh!

magikRUKKOLA commented Mar 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jukofyork commented Mar 26, 2026

Uh oh!

magikRUKKOLA commented Mar 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

magikRUKKOLA commented Mar 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

magikRUKKOLA commented Mar 26, 2026

Uh oh!

SamuelOliveirads commented Mar 26, 2026

Uh oh!

magikRUKKOLA commented Mar 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SamuelOliveirads commented Mar 26, 2026

Uh oh!

magikRUKKOLA commented Mar 26, 2026

Uh oh!

jukofyork commented Mar 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SamuelOliveirads commented Mar 27, 2026

Uh oh!

magikRUKKOLA commented Mar 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SamuelOliveirads commented Mar 27, 2026

Uh oh!

SamuelOliveirads commented May 3, 2026

Benchmark (GLM 5 --runs 1 --max-tokens 1500)

Uh oh!

ikawrakow May 4, 2026

Choose a reason for hiding this comment

Uh oh!

SamuelOliveirads May 4, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

magikRUKKOLA commented Mar 26, 2026 •

edited

Loading

magikRUKKOLA commented Mar 26, 2026 •

edited

Loading

magikRUKKOLA commented Mar 26, 2026 •

edited

Loading

magikRUKKOLA commented Mar 26, 2026 •

edited

Loading

magikRUKKOLA commented Mar 26, 2026 •

edited

Loading

jukofyork commented Mar 26, 2026 •

edited

Loading

magikRUKKOLA commented Mar 27, 2026 •

edited

Loading

Benchmark (`GLM 5 --runs 1 --max-tokens 1500`)